DS 5001 Project Notebook: Greek and Roman Mythology

The code used for processing the files and working on them is available in the eta_modules folder to separate most of the implementation details from the presentation of results. I try to show additional relevant functionality beyond pure results to give a better indication of the process of arriving at these results.

Reading in the data

We start by loading in the XML files for each work and parsing them to a reasonable degree with BeautifulSoup and NLTK.

Since these works are all either plays or poems/epics, the concept of a "chapter" or "paragraph" doesn't translate perfectly compared to, e.g., a novel. However, the Perseus Digital Library (where these files are sourced from) has added at least top-level divisions to break up texts. In some cases, these divisions truly exist in the text (for example, The Iliad is broken into 24 books); in other cases, like plays, these divisions don't seem to be directly present in the text, but are akin to something like a "scene". I've considered all of these largest divisions as "chapters".

To get at something like a "paragraph", I used a different approach based on whether the work was a play or not:

I've included code/functions that are able to either parse the XML files from their initial state or load in pre-computed Corpus tables to speed up notebook computations.

Term frequency methods used in this notebook

A chapter-level bag tends to give more significant nouns, while a paragraph-level bag or smaller will likely result in more terms like pronouns being significant.

Hierarchical Clustering based on TF-IDF (max normalization)

Basic process:

Dendrograms

Overall, both metrics lead to fairly good segmentation, both by author and between plays (Aeschylus, Sophocles, and Euripides) vs. non-plays. Additionally, with both metrics, both of the Roman works (by Vergil and Ovid) have a high degree of similarity to each other, and they are also grouped together with both of Homer's works, which likely indicates some artistic influence. The Aeneid certainly takes inspiration from the Iliad and the Odyssey.

Principal Components Analysis using TF-IDF covariance matrix

Basic process:

PCA plots

Topic Models (LDA)

Topic models can provide an interpretable, high-level model for the patterns and themes of documents. The main outputs of a Latent Dirichlet Allocation (LDA) model are a $\theta$ and $\phi$ table.

Some additional details are available from the scikit-learn website or Wikipedia

Top words

The below dataframe presents how often particular words appear in the top 10 words associated with each topic. There are 400 words total (10 words representing each topic x 40 topics), so the p column is n/400.

Topic weights

Topic weights are calculated as the sum of each topic's column in the $\theta$ table.

Authors and Topics

In this table, you can see which topics are most highly associated with each other. Darker cells indicate topics that are more highly associated with a given author. For example, Vergil (who has a single work in this corpus, the Aeneid), is highly associated with the topic that has top terms of "love, son, arms, blood, death, life, words, waves, sea, eyes" and "war, hand, sword, arms, foe, gods, way, thee, shield, spear". Given their equal weights, they could represent the two main sections of the Aeneid: the journey by sea to Italy after end of the Trojan War and the battles that took place in Italy after their arrival.

Word Embeddings (word2vec)

Word embedding algorithms produce representative vectors for words based on word co-occurrence statistics in various contexts. So, words which appear in similar contexts (are surrounded by similar words) should have similar vectors.

2-dimensional representation of word vectors

Looking through these, it's possible to notice how some related words can cluster together:

Word analogies

It's also possible add and subtract word vectors to produce "word analogies" that can sometimes produce interesting results.

Sentiment Analysis

The sentiment analysis performed here uses a lexicon-based approach (NRC Emotion Lexicon).

Average sentiments by author or work

Chapter table

Cells here are colored based on sentiment value in each chapter. Darker colors indicate more intense emotion for a particular column.

Saving Tables